COVID-19: Modeling The Relative Impact on US States

By Miguel Chavez

Introduction

Just days after the the first SARS-Cov-2 cases were identified in the United States, the degree to which different states were affected became apparent. The USNS Hospital Ship Comfort arrived at Pier 88 of New York Harbor on the Monday of March 30, while life in rural US remained unnafected for the majority. This difference of impact continued throughout the summer and into the fall, as state and local governments took an array of different approaches to balance the public health impact of COVID-19 with the economic impact.

In the following analysis, I will attempt to display varying degrees of impact COVID-19 has had on six states:

  • North Carolina
  • Maryland
  • California
  • New York
  • Texas
  • Michigan

These states represent a vareity of demographics and geographies. By analyzing the death toll of these states, as well as their attributes such as geography and population density, I hope to identify casual relationships that will be useful in building a predictive model.

Evironment Setup

Pandas and Geopandas will be used for data manipulation, Bokeh for interactive plotting, Folium for interactive maps, and SciKit Learn for ML model creation.

In [3]:
#!pip install jupyterthemes folium geopandas
!jt -t gruvboxd -T -N -cellw 70%
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:80% !important; margin-left:15% !important;  } }</style>"))
In [4]:
import folium
import requests
import pandas as pd
import geojson
import json
import numpy as np
import geopandas as gpd
from datetime import datetime
from sklearn import preprocessing
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh.plotting as bkplt
from bokeh.io import output_notebook, show
from bokeh.resources import INLINE
from bokeh.layouts import gridplot as grid, row, column
from bokeh.models import LinearAxis, Range1d
import censusdata
from bokeh.themes import built_in_themes, Theme
from bokeh.io import curdoc
from bokeh import palettes
#curdoc().theme = 'dark_minimal'
plt.style.use('fivethirtyeight')
sns.set_context('poster')
bkplt.output_notebook()
colors = palettes.Set3[6]
states = ['North Carolina', 'Maryland', 'California', 'New York', 'Texas', 'Michigan']
populations = [10385000, 6045000, 39500000, 19450000, 29000000, 10000000]
scaler = preprocessing.MinMaxScaler(feature_range=(0, .9))
with open('customTheme.json') as f:
    d = json.load(f)
    curdoc().theme = Theme(json = d)
Loading BokehJS ...

Data Collection and Management

Data Sources

COVID-19 data is sourced from Novel Corona Virus 2019 Dataset - Kaggle, GeoJson county level files from Opendatasoft, and US Census data is pulled from the CensusData Api

Reading and Storing Data

First, using pandas, we read in the COVID-19 dataset into a DataFrame, pandas own two dimensional data structure.

In [5]:
cov19df = pd.read_csv("time_series_covid_19_deaths_US.csv")
cov19df.iloc[-2:]
Out[5]:
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 11/27/20 11/28/20 11/29/20 11/30/20 12/1/20 12/2/20 12/3/20 12/4/20 12/5/20 12/6/20
3338 84056043 US USA 840 56043.0 Washakie Wyoming US 43.904516 -107.680187 ... 8 8 8 8 8 8 8 8 8 8
3339 84056045 US USA 840 56045.0 Weston Wyoming US 43.839612 -104.567488 ... 1 1 1 1 2 2 2 2 2 2

2 rows × 332 columns

Next, using GeoPandas, we read in the geoJSON files downloaded from ODS, into a GeoDataFrame. This is nothing but a pandas DataFrame with defined columns for geometry, but the use of the GeoPandas library allows easy conversion from a geoJSON into a DataFrame. You may notice the line:

mdGdf.at[15, 'name'] = 'Baltimore City'

This is an example of Data pre-processing, since the counties of Baltimore City as well Baltimore County were both marked as 'Baltimore' in the name column of our GeoDataFrame, later merges with the COVID-19 dataset (which contain a 'Baltimore' and 'Baltimore City') would be unsuccesful or incomplete if not resolved. Manually fixing data can be a painstaking yet neccesary part of data science.

In [6]:
nc_geo = "ncCounties.geojson"
tx_geo = "txCounties.geojson"
ny_geo = "nyCounties.geojson"
ca_geo = "caCounties.geojson"
md_geo = 'mdCounties.geojson'
mi_geo = 'miCounties.geojson' 
ncGdf = gpd.read_file(nc_geo)
txGdf = gpd.read_file(tx_geo)
nyGdf = gpd.read_file(ny_geo)
caGdf = gpd.read_file(ca_geo)
mdGdf = gpd.read_file(md_geo)
miGdf = gpd.read_file(mi_geo)
mdGdf[mdGdf['name'] == 'Baltimore']
Out[6]:
intptlat countyfp_nozero name cbsafp funcstat intptlon lsad stusab classfp awater ... geoid aland countyfp countyns mtfcc namelsad statefp state_name metdivfp geometry
7 +39.4431666 5 Baltimore 12580 A -076.6165693 06 MD H1 215959023 ... 24005 1549745106 005 01695314 G4020 Baltimore County 24 Maryland None POLYGON ((-76.88730 39.44050, -76.88732 39.440...
15 +39.3000324 510 Baltimore 12580 F -076.6104761 25 MD C7 28758714 ... 24510 209650970 510 01702381 G4020 Baltimore city 24 Maryland None POLYGON ((-76.71151 39.36621, -76.71151 39.366...

2 rows × 21 columns

In [7]:
mdGdf.at[15, 'name'] = 'Baltimore City'
mdGdf[mdGdf['name'] == 'Baltimore']
Out[7]:
intptlat countyfp_nozero name cbsafp funcstat intptlon lsad stusab classfp awater ... geoid aland countyfp countyns mtfcc namelsad statefp state_name metdivfp geometry
7 +39.4431666 5 Baltimore 12580 A -076.6165693 06 MD H1 215959023 ... 24005 1549745106 005 01695314 G4020 Baltimore County 24 Maryland None POLYGON ((-76.88730 39.44050, -76.88732 39.440...

1 rows × 21 columns

Last we will use the CensusData Api to collect population figures for each county in our selected states, these numbers will also be stored in a DataFrame. To use the CensusData Api we need the ANSI FIPS codes for our states, which our convienently located in the GeoDataFrames we just produced.

In [8]:
def censusData(stateCode):
    censusDf = censusdata.download('acs5', 2015, censusdata.censusgeo(
        [('state', stateCode),('county', '*')]), ['B01001_001E'])
    censusDf = censusDf.reset_index()
    censusDf['name'] = censusDf.apply(lambda x: str(x['index']).split(" Count")[0] , axis = 1)
    censusDf = censusDf.rename(columns = {'B01001_001E': 'pop'}).drop(['index'], axis = 1)
    return censusDf 
ncCensus = censusData(ncGdf.iloc[0]['statefp'])
mdCensus = censusData(mdGdf.iloc[0]['statefp'])
txCensus = censusData(txGdf.iloc[0]['statefp'])
nyCensus = censusData(nyGdf.iloc[0]['statefp'])
caCensus = censusData(caGdf.iloc[0]['statefp'])
miCensus = censusData(miGdf.iloc[0]['statefp'])
ncCensus.iloc[-2:]
Out[8]:
pop name
98 37971 Yadkin
99 17604 Yancey

To make things easier, lets gather our DataFrames in a single data structure.

In [9]:
frames = {"Maryland":{"census": mdCensus, "gdf": mdGdf},
         "North Carolina":{"census": ncCensus, "gdf": ncGdf},
         "Michigan":{"census": miCensus, "gdf": miGdf},
         "Texas":{"census": txCensus, "gdf": txGdf},
         "New York":{"census": nyCensus, "gdf": nyGdf},
         "California":{"census": caCensus, "gdf": caGdf}}

Now we have our geographic and census DataFrames stored together, but our COVID-19 dataset is still in an unusable form. Before we seperate it into different our different states, we need to make it Tidy, currently the DataFrame has one row for every county, and columns for every day, with other attributes in columns as well, to make this data tidy, we need to make every observation (every day, for every county) into its own row. To do this we use melt a pandas function to unpivot our DataFrame. We do this for every state and then add these frames to our dictionary of frames.

In [10]:
for f in frames:
    #select the state's covid data
    state_covid_data = cov19df[cov19df['Province_State'] == f]
    # using melt
    state_covid_data = state_covid_data.rename(columns = {"Admin2": "name"})
    state_covid_data = pd.melt(state_covid_data,id_vars='name', value_vars = list(state_covid_data.columns[12:332].values),
        var_name = 'date', value_name = 'deaths')
    frames[f]['covid'] = state_covid_data

Now that we have all of our data, we could get straight to plotting and analysis, but to make our lives easier we should join these frames. We will use 'merge' the pandas function analogous to SQL joins. Since our rows will now all contain state data as well, we can drop the use of a dictionary to hold our frames, and append them together.

In [11]:
df = pd.DataFrame()
for f in frames:
    df = df.append(frames[f]['covid'].merge(frames[f]['census'], on = 'name', how = 'inner').merge(frames[f]['gdf'], on = 'name', how = 'inner'))
df.iloc[-2:]
Out[11]:
name date deaths pop intptlat countyfp_nozero cbsafp funcstat intptlon lsad ... geoid aland countyfp countyns mtfcc namelsad statefp state_name metdivfp geometry
18558 Yuba 12/5/20 11 73437 +39.2701300 115 49700 A -121.3442587 06 ... 06115 1636913845 115 00277322 G4020 Yuba County 06 California None POLYGON ((-121.59768 39.12779, -121.59781 39.1...
18559 Yuba 12/6/20 11 73437 +39.2701300 115 49700 A -121.3442587 06 ... 06115 1636913845 115 00277322 G4020 Yuba County 06 California None POLYGON ((-121.59768 39.12779, -121.59781 39.1...

2 rows × 24 columns

Date is currently being represented as a string, which isn't optimal, lets convert it now to datetime.

In [12]:
df['datetime'] = df.apply(lambda x: datetime.strptime(x['date'], '%m/%d/%y'), axis = 1)

Since population density will be of use to use later, lets go ahead and calculate that now. The column 'aland' is the area in land in $m^{2}$ for each county, originally obtained from our GeoJSON files, I'll be using $km^{2}$ so you'll notice a factor of $1000^{2} = 1000000$ coming into play. We also want state death totals for each day, so I'll do that as well, using a groupby operation, which like merge, is analagous to the same SQL statement

In [13]:
dfTotals = df.groupby(by = ['datetime', 'state_name']).sum().reset_index()
df = df.reset_index(drop = True)
df['pDensity'] = df.apply(lambda x: (x['pop']/x['aland'])*1000000, axis = 1)
dfTotals['pDensity'] = dfTotals.apply(lambda x: (x['pop']/x['aland'])*1000000, axis = 1)
display(df.iloc[-2:], dfTotals.iloc[-2:])
name date deaths pop intptlat countyfp_nozero cbsafp funcstat intptlon lsad ... countyfp countyns mtfcc namelsad statefp state_name metdivfp geometry datetime pDensity
185598 Yuba 12/5/20 11 73437 +39.2701300 115 49700 A -121.3442587 06 ... 115 00277322 G4020 Yuba County 06 California None POLYGON ((-121.59768 39.12779, -121.59781 39.1... 2020-12-05 44.863082
185599 Yuba 12/6/20 11 73437 +39.2701300 115 49700 A -121.3442587 06 ... 115 00277322 G4020 Yuba County 06 California None POLYGON ((-121.59768 39.12779, -121.59781 39.1... 2020-12-06 44.863082

2 rows × 26 columns

datetime state_name deaths pop awater aland pDensity
1918 2020-12-06 North Carolina 5543 9845333 13463401534 125925929633 78.183524
1919 2020-12-06 Texas 23137 26538614 18991880422 676668210823 39.219537

Now that we have county and state level DataFrames of tidy data, we can move onto analyis.

Exploratory Data Analysis

State Level Visualization

I'll be primarily using Bokeh for visualization, since Bokeh allows interaction once exported to html, which matplotlib/seaborn do not, at least not in an easy to apply fashion. To get a feel for some trends visually, I'll plot three things:

  1. $$\textrm{Cumulative Deaths: } \int_{i = 3/02/20}^{12/6/20} deaths_{i}$$
  2. $$\textrm{Cumulative Deaths Per Capita: }\int_{i = 3/02/20}^{12/6/20} \frac{deaths_{i}}{pop}$$
  3. $$\textrm{Cumulative Deaths Per Population Density $km^{2}$: }\int_{i = 3/02/20}^{12/6/20} \frac{deaths_{i}}{pop/km^{2}} \equiv \int_{i = 1/22/20}^{12/6/20} \frac{deaths_{i}*km^2}{pop} $$

To plot efficiently, I'll zip two lists, one of the names of my states, and another of my chosen colors from the palette I defined during my environment setup.

In [14]:
states_colors = zip(states, colors)
f1 = bkplt.figure(plot_width = 1200, plot_height = 600, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths',
    title ="Cumulative Deaths for Selected States", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']),sizing_mode="stretch_both")
f2 = bkplt.figure(plot_width = 1200, plot_height = 600, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Capita',
    title ="Cumulative Deaths Per Capita for Selected States", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
f3 = bkplt.figure(plot_width = 1200, plot_height = 600, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Population Density',
    title ="Cumulative Deaths Per Population Density for Selected States", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
for state,color in states_colors:
    df1 = dfTotals[dfTotals['state_name'] == state].copy()
    df1['var2'] = df1.apply(lambda x: x['deaths']/x['pop'], axis = 1)
    df1['var3'] = df1.apply(lambda x: x['deaths']/x['pDensity'], axis = 1)
    f1.line(df1['datetime'], df1['deaths'], line_width = 6, alpha = .8, color = color, legend_label= state,)
    f2.line(df1['datetime'], df1['var2'], line_width = 6, alpha = .8, color = color, legend_label= state,)
    f3.line(df1['datetime'], df1['var3'], line_width = 6, alpha = .8, color = color,legend_label= state)
for f in [f1,f2,f3]:
    f.legend.location = "top_left"
    f.legend.title = "Click To Hide"
    f.legend.title_text_color = "White"
    f.legend.click_policy="hide"
show(column(f1,f2,f3, sizing_mode = 'stretch_width'))

Initial Observations

The extremely fast rise of deaths in New York immediately pops out at us, both on an absolute as well as per capita basis. Viewing certain pairs of states also serce for interesting comparisons. Try hiding every state besides California, Texas, and Michigan in the per capita plot. We see an initial surge from Michigan, while Texas and California remain tightly paired. As the weather heats up though, deaths in Texas rose dramatically. I would guess this is due to more people congregating inside during the summer months, compared to the more comfortable summers of CA and MI. This trend reverses as the weather cools off, and Michigan deaths begin to rise dramatically. I would assume this is a similiar phenomenon, as the weather becomes unbearable, so do outside gatherings, and more people are bound to catch COVID-19 inside. When controlling for population density in the third graph, the problem with comparing states to eachother becomes apparent. Texas deaths skyrocket, but this is almost certainly due to the many largely uninhabited counties with extremly large land areas, skewing their population density far lower than the density the average Texan actually lives at. Therefore it is clearly neccesary to do further exploratory data analysis at a county level before attempting to build a and train a model.

County Level Visualization

To get a better understanding of what's going on at the county level, I'll make two plots for each state, one displaying deaths per capita for each county, and one displaying deaths while controlling for population density, like done in the last two statewide plots.

In [15]:
plots = []
states_colors = zip(states, colors)
for state, color in states_colors:
    f2 = bkplt.figure(plot_width = 800, plot_height = 500, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Capita',
        title ="Cumulative Deaths Per Capita for Counties in "+ state, x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
    f3 = bkplt.figure(plot_width = 800, plot_height = 500, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Population Density',
        title ="Cumulative Deaths Per Population Density for Counties in " + state, x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
    df1 = df[df['state_name'] == state].copy()
    df1['var2'] = df1.apply(lambda x: x['deaths']/x['pop'], axis = 1)
    df1['var3'] = df1.apply(lambda x: x['deaths']/x['pDensity'], axis = 1)
    df1 = df1.sort_values(by = 'datetime')
    county_count = len(df1['name'].unique())
    top5 = list(df1[-county_count:].sort_values(by = 'deaths')['name'][-5:])
    top5.reverse()
    for s in df1['name'].unique():
        alpha = .1
        line_width = 2
        df2 = df1[df1['name'] == s].copy()
        if( s in top5 ):
            f2.line(df2['datetime'], df2['var2'], alpha = .5, color = color, line_width = 3, muted_alpha=1, legend_label= s, muted_line_width = 6)
            f3.line(df2['datetime'], df2['var3'], alpha = .5, color = color, line_width = 3, muted_alpha=1, legend_label= s, muted_line_width = 6)
            continue
        f2.line(df2['datetime'], df2['var2'], alpha = .1, color = color, line_width = 2)
        f3.line(df2['datetime'], df2['var3'], alpha = .1, color = color, line_width = 2)
    for f in [f2,f3]:
        f.legend.location = "top_left"
        f.legend.title = "Click To Hide"
        f.legend.title_text_color = "White"
        f.legend.click_policy="mute"
        f.legend.title = "Top 5 "+state+" Counties (by deaths): Click To Highlight"
    plots.append([f2,f3])
show(grid(plots, sizing_mode = 'stretch_width'))